-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for more file types + archlinux packaging #9
base: master
Are you sure you want to change the base?
Conversation
qrexec really provides pipe-like connection between two processes, running in different VMs. You can emulate it locally with
I don't like this added interactive part to the protocol. Why not simply display a password prompt from within DispVM (server part)? |
Ran into another issue: So multiples options:
Just download the latest unoconv version from github https://github.com/unoconv/unoconv/blob/master/unoconv and running it on debian buster: work as expected (Also, played a bit with libreoffice (in fact found nearly all that code on the internet but didn't saved the original url) macro to remove password)
|
That's me! :D I didn't know about Dangerzone when I started but you've certainly put me on to them now! Looks like the major differences are OCR, compressed PDF size, and multiple file types. Right now, I'm just trying to stabilize my PR before adding any more new features. But, if you'd like, I certainly wouldn't mind some extra help on implementing those things afterwards. It seems you already have the multiple file types down, and porting them to Python wouldn't be too hard (he said hopefully). |
Hi, |
From the links you pointed out, it looks like OCR (well, Tesseract anyways) doesn't work on raw RGB bitmaps but more finalized image formats like PNG. Idk where Dangerzone does its OCR but we create PNGs client-side, meaning OCR would occur on the client as well, which is assumed to be safe. The main things we need to worry about are what the server sends over: page count, image dimensions, and RGB bitmaps. Valid but incorrect submissions of these would be a pain to deal with. |
Some updates about debian: Tried the following workaround (seems working but didn't tested it on all templates yet): libreoffice --accept='socket,host=localhost,port=2202;urp;' --norestore --nologo --nodefault >/dev/null 2>/dev/null &
listener_notready=1
# Wait until libreoffice server is started
while [ $listener_notready -ne 0 ];
do
sleep 1
netstat -anop 2> /dev/null | grep '127.0.0.1:2202' | grep LISTEN >/dev/null 2>/dev/null
listener_notready=$?
done
# Remove password from file using libreoffice API
# ...
python3 -c '
import os
import uno
from com.sun.star.beans import PropertyValue
import sys
src="file://'"$INPUT_FILE"'"
dst="file://'"$INPUT_FILE.nopassword"'"
password="'"$PASSWORD"'"
localContext = uno.getComponentContext()
resolver = localContext.ServiceManager.createInstanceWithContext("com.sun.star.bridge.UnoUrlResolver", localContext)
ctx = resolver.resolve("uno:socket,host=localhost,port=2202;urp;StarOffice.ComponentContext")
smgr = ctx.ServiceManager
desktop = smgr.createInstanceWithContext("com.sun.star.frame.Desktop", ctx)
hidden_property = PropertyValue()
hidden_property.Name = "Hidden"
hidden_property.Value = True
password_property = PropertyValue()
password_property.Name = "Password"
password_property.Value = password
document = desktop.loadComponentFromURL(src, "_blank", 0, (password_property, hidden_property,))
document.storeAsURL(dst, ())' >&2
# ...
libreoffice --convert-to pdf "$INPUT_FILE.nopassword" --outdir /tmp/ >&2
mv "$INPUT_FILE"".pdf" "$INPUT_FILE" It remove the unoconv dependency but it add around 40 lines of code. UpdateThe workaround seems to be working on all templates I tested (archlinux, debian buster, and fedora 32) TODO:
|
Pretty much all I wanted to add is done. Putting this pull request on hold until the port from bash to python is done and accepted. And then need to integrate those changes to the python port, but should be too hard |
Update: since the bash -> python rewrite is over, I started to rewrite this pull request in python. |
@Bl0nd could you do a first review of the python code ? :) But probably lot of rooms for improvements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have Qubes installed right now so this is just from me looking over the code a bit.
In addition to my comments, here are some general thoughts:
-
As it stands, I'd rather not support password-protected PDFs. The main issues with this implementation are that passwords are specified as command-line arguments, you may only specify 1 password even though multiple files may have different passwords, and
_decrypt()
is ironically incredibly cryptic relative to the rest of the codebase. -
It seems to me that
--gui
is just for password prompts? If so, I think it's a bit unnecessary. If password-protected PDFs were supported, GUI prompts are basically the only option considering that multiple files can be processed. Plus, GUI prompts shouldn't be created from the server (for security reasons), but from requests made to Dom0 which always provides a GUI. -
Having to use raw sockets for multiple file type support... Is this really the only the way? 😅
Note that I didn't review much of server.py
. That's mainly due to the thoughts I listed above.
qubespdfconverter/server.py
Outdated
########################### | ||
# The project "Dangerzone" reused the idea of this script based on: | ||
# https://blog.invisiblethings.org/2013/02/21/converting-untrusted-pdfs-into-trusted.html | ||
# | ||
# - https://github.com/firstlookmedia/dangerzone-converter | ||
# - https://dangerzone.rocks/ | ||
# - https://github.com/firstlookmedia/dangerzone | ||
# | ||
# Dangerzone try to export the idea to non Qubes based system, and try to improve it. | ||
# Both projects can improve the other. | ||
########################### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, or at least not in this way.
My intend was to say to a potential contributor "Hey, a very similar project exist, and maybe we can take back some good ideas and re-implement them here".
IDK if:
- Should not do that at all
- Should do that, but way less verbose and/or somewhere else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced that with a smaller comment: https://github.com/neowutran/qubes-app-linux-pdf-converter/blob/master/qubespdfconverter/server.py#L23
Thanks you very much for the review ! :)
About the "_decrypt()", i sadly agree. I don't know how to improve the code of this part since it is mostly libreoffice API.
Yes, "--gui" is just for password prompts. The original intend was to have 2 way of using this tools, the "standard" way, and as a batch process (So without any gui or blocking prompt). However, after reading your comments, it seems that it is more complex than the interest it have. However I will keep the password-protected file support: I see more and more phishing campaign using password-protected office file containing malicious macro / commands, so I [and probably others] have a need for that feature.
Can you explain more about that ?
Yeah, not particularly elegant... |
As @neowutran already said - having password prompt on the server side should be fine. This way, all the things that interacts with potentially malicious PDF file are isolated in that VM. Plus, it will allow to avoid protocol change (server will prompt for the password itself, no need to receive it from the client). That will be at the cost of having password prompt always interactive (no way to script it), but in my opinion it isn't big issue. There is something wrong with commits here, a lot of conflicts. Most likely because of commits for the old version still present here. I think the easier way to fix it is to forcefully rebase the whole thing:
Generally I would recommend creating separate branch for pull requests (and various changes in general), instead of using master in your repo, but that's unrelated topic. PS I haven't looked at the changes yet, only at comments here. |
Oh, I want password-protected file support too, I just meant not with this specific implementation.
I guess I just thought untrusted VMs shouldn't draw prompts or something. Marek said it was fine though so forget about that part. |
Things that doesn't work yet:
|
I think it is ready for another review. Things that changed since last time:
(tested on archlinux, fc32 and buster) CI build still doesn't work: Python still doesn't find the libreoffice UNO lib, no idea why. |
@Bl0nd if you want to review it once more :) |
Been a bit busy with some other stuff lately, sorry about that. I'll try to review it as soon as I can. |
Thanks for the review @Bl0nd ! |
Nice work! Btw, could you resolve the conversations you've dealt with? It'd make it easier to see what still needs to be worked on. |
Done, 2 unresolved conversations. One a side note, travis CI is still not working, python doesn't find the libreoffice dependency to resolve "uno". For the moment no idea why, but probably better to keep that for the end From my understanding, it doesn't work because: Update: |
I think first you can try to solve the Makefile conflict (rebase) then @marmarek? |
…ge nothing except that pylint stop complaining. See https://www.openoffice.org/udk/python/python-bridge.html#mapping
Seems like the new version of libreoffice (>= 7.0) on archlinux introduced a bug/regression in the "storeAsURL" API for some file type (.docx). So WIP again to find a workaround / open bug ticket |
I was unable to find an explanation of why the "decrypt" function stop working with libreoffice >= 7, so filled a bug ticket https://bugs.documentfoundation.org/show_bug.cgi?id=137926 |
Anyway it is ready for review, could instruct archlinux user to select "libreoffice-still" (instead of "libreoffice-fresh") until libreoffice team fix the issue. |
Some idea of workaround for the libreoffice bug (I didn't found the motivation nor to [ learn c++ and learn libreoffice codebase ] nor to [ recompile and test libreoffice to find the faulty commit between 6.4 and 7.0] ):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE script:module PUBLIC "-//OpenOffice.org//DTD OfficeDocument 1.0//EN" "module.dtd">
<script:module xmlns:script="http://openoffice.org/2000/script" script:name="Module1" script:language="StarBasic">REM ***** BASIC *****
Sub Main
dim properties(1) as new com.sun.star.beans.PropertyValue
url = convertToURL("/home/user/qubes-app-linux-pdf-converter/tests/files_success/doc.doc")
properties(0).Name = "Password"
properties(0).Value = "toor"
properties(1).Name = "Hidden"
properties(1).Value = True
doc = StarDesktop.loadComponentFromUrl(url, "_blank", 0, properties())
dim properties2(0) as new com.sun.star.beans.PropertyValue
doc.storeAsURL("file:////home/user/qubes-app-linux-pdf-converter/tests/files_success/tata.doc", properties2())
StarDesktop.Terminate()
End Sub
</script:module> Then call libreoffice soffice --nologo --norestore --nodefault --headless 'macro:///Standard.Module1.Main' |
…able to libreoffice when launching the service, it is possible that this modification won't be required with the libreoffice version provided by the distribution, but for the moment I need to apply a patch for libreoffice
Found the commit that broke the behavior that I was relying on. Also, since I now understand this change of behavior, I wrote a workaround for this issue. |
Main "features" of this pull request
Minor improvements
Things that need improvement / TODO:
How to quickly tests the software while developing it ? I created a new directory called "dev_tools", it contain some safe tests file and a script to automatically convert them.Fix debian and fedora packagingIn the client script I added a code that read the pdf file to get its size
wc -c "$INPUT_FILE" | cut -d ' ' -f 1
I assumed that counting the number of bytes of a file is a operation simple enough to be bug free.
This line is part of the changes I did to support password protected file from the GUI (The server script tell the client script "Hey! It need password!" Or "Hey! All fine, it doesn't need password!" depending on that, the GUI prompt a password form. So I needed avoid closing the file descriptor
exec >&-
to be able to keep communicating between client and server after the file have been transferred.Didn't found a way to avoid this bytes counting on the file.
Why this pull request
I found this project https://dangerzone.rocks https://github.com/firstlookmedia/dangerzone https://github.com/firstlookmedia/dangerzone-converter from Micahflee who implemented the idea for non-Qube based system and improved it to support more file type. So trying to implement it back for Qubes, and improving it a bit further to support password protected file and allow some more file type (still only the files supported by libreoffice, but I am way more permissive with mime type check)
Tests I did
The things in "dev_tools"
Tested it on archlinux, with gui and with cli.
Didn't tested it on debian/fedora yet
I see there is a pull request to re-implement the script from bash to python.
So this pull request will require modifications once the bash -> python is done